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DATA MINING METHOD AND SYSTEM USING REGRESSION CLUSTERING 

BACKGROUND 

Field of the Invention 

(»„„„ The present disclosure generally re,a.es to data mining and. more 
spa-***, to methods and systems (or regressive* clustering a dataset. 
Backqround Information 

toooa] With the increase in the amount of data being stored in databases as 
IT , .he numbar ot database applications in business and the s«en« 
Imam the need to en»en«y and aoturataly analyze da. . The 
el -data mining; may be used ,o describe such an enable of da. and may 
referred . herein as the procase of «en«fying and ih.emre.ng patterns .n 
2ZL Quick and accurate da. mining may offer a variety o, bene*, for 
ZZZ. in which da. is ac.mula.ed. Ear exampie. a better unde^n g 
Osmond curves w»n a marte. may help a business . design mu»,ple mode* 
I a product family tor drfteren, segr-nfs o, « mart... Similarly, .he des^ a 
madlng campa.ee and purchase incentfce offerings may be more effective 

present .0 ail customers. In any case, da. may, ,n some embodiments, be 
Tmd a, a variety a, .Ciena. For exemple. sa.s da. for a business may be 
^ region J looeys. In order . mine ,he da. as a whole, lerge memory 
^L«ons may be needed ,0 ga.be, and procasa .he da., padiculany when a 
laroe amount of da. is distributed across a plurality of sources. 
^3 conseguentty, » would be advan.geoua .0 deveiop ay^ma and 
Lhods for mining da.. In pariicu.,. » would be advan.geous .0 deveiop 
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me thods and systems for mining data from datasets distributed across a piura.ity 
of locations. 

BRIEF SUMMARY 

,00041 The problems outlined above may be in large part addressed by a 
Lhod and a system which regressive* duster datopointo from a pluralay o 
dat a sources w«hou« .ransfening data be*,een One pluralify o, data sources to 
add-on. a metood and a system am provided vmich mine dato from a datose, by 
„era«ively applyfng a region algorithm and a K-Harmonic Means performance 
function on a set number of functions derived from the dataset. 

BRIEF DESCRIPTION OF THE DRAWINGS 
I0 „0 5 ] For a desert descripdon of the exemplar embodiments of me 
London, referenoe will now be made to Ore accompanying drawings in whrch; 
^1 Fig. t depicts a schemaOc diagram o, a system tor regress„ely 
clustering datapoinfs from a plurality of data sources; 
[00071 Fig. 2 depicts a flow chart of a method for mining data; and 
0000 Fig. 3 depicts a «ow chad o, a method for compensadng for vanabons 
beW raen similar sers o, functions w«hin datasets of a plurality o, data sources 
,00091 While the invention is suscepabte «o venous modified and anemabve 
Ls ape* entoodiments thereof ere shown by way of example „ *. 
dmwirjand wil, herein be described in deto. It should be underatoo . 
ncever. ma. .he drawings and detailed description .hereto era no. intended «o 
M toe invention to the particular torm disdosed, but on toe contrary toe 
inrenoon * to covet a„ morons, egufvalento and altomatvea «,ngw*»" 
Ore spirit and scope of the present invention as defined by toe appended claims. 

NOTATION AND NOMENCLATURE 
,0010) Certain terms ere used throughout toe following description and claims to 
efer to particular system components. As one skilled In toe art will apprecato 
various companies may refarto a component by dnferen, names. This doormen 
coes no, intend to disOnguish between components .ha. differ in name but n« 
funcdon. in toe following discussion end in toe claims, .he terms mclud ng end 
comprismg- are used in an open-ended fashion, and toes should be mtorpre ed 
Ze n -including, bu, no. iimaed to .., Also, toe term -couple" or "coopies is 
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intended to mean either an indirect or direct electrical connection. Thus, if a first 
device couples to a second device, that connection may be through a direct 
electrical connection, or through an indirect electrical connection via other devices 
and connections. In addition, the term, "data mining," as used herein, may 
generally refer to the process of identifying and interpreting patterns in databases. 

DETAILED DESCRIPTION 
[0011] The following discussion is directed to various embodiments of the 
invention. Although one or more of these embodiments may be preferred, the 
embodiments disclosed should not be interpreted, or otherwise used, as limiting 
the scope of the disclosure, including the claims, unless otherwise specified. In 
addition, one skilled in the art will understand that the following description has 
broad application, and the discussion of any embodiment is meant only to be 
exemplary of that embodiment, and not intended to intimate that the scope of the 
disclosure, including the claims, is limited to that embodiment. 
[0012] Turning now to the drawings, exemplary embodiments of systems and 
methods for mining data from one or more datasets by iteratively applying a 
regression algorithm and a clustering performance function on each of the 
datasets are provided. In particular, system 10 is shown in Fig. 1 and is 
configured to regressively cluster datapoints from a plurality of data sources 12a- 
12c without transferring data between the plurality of data sources. More 
specifically, system 10 is configured to regressively cluster one or more datasets 
within each of data sources 12a-12c individually and, in some embodiments, in 
parallel. As will be discussed in more detail below, an Expectation Maximization 
(EM) objective function, a K-Means (KM) objective function or a K-Harmonic 
Means (KHM) objective function may be used to regressively cluster the datasets 
stored within data sources 12a-12c. Each objective function offers a different 
approach for regressively clustering data and, therefore, at least three distinct 
methods are provided for which system 10 may be configured to regressively 
clustering data. Consequently, although an exemplary method for performing 
regression clustering using a K-Harmonic Means objective function is illustrated in 
the flowchart of Fig. 2 (discussed below), system 10 is not restricted to using such 
a method for regressively clustering data within data sources 12a-12c. 
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[0013] Regardless of the type of objective function used to regressively cluster 
data within data sources 12a-12c, system 10 may be configured to collect 
matrices from data sources 12a-12c. The matrices are representative of 
datapoints within the data sources. Such matrices may be used by system 10 to 
determine common coefficient vectors by which to alter functions within the 
datasets of data sources 12a-12c such that variations between similar functions 
of the datasets may be minimized. As a result, system 10 may be configured to 
mine the datasets from data sources 12a-12c as a whole. A more detailed 
description of the method for collecting the matrices and determining the 
coefficient vectors is provided below in reference to the flowchart depicted in 
Fig. 3. 

[0014] In general, data sources 12a-12c may be representative of distributed 
data. Consequently, data sources 12a-12c may include similar variables by 
which to store information, but may have distinct sets of data. For example, each 
of data sources 12a-12c may include sales tracking information, such as gross 
and net profits, number of units sold, and advertising costs. Each of data sources 
12a-12c may respectively represent different regions of the sales organization 
and, therefore, may have different values for each of the variables of the sales 
tracking information. In some embodiments, data sources 12a-12c may work 
independently of each other and, therefore, may not share data other than 
communicating to central station 14 as described in more detail below. In other 
embodiments, however, data sources 12a-12c may be adapted to share some or 
all data. Although system 10 is shown to include three data sources, system 10 
may include any number of data sources, including a single data source, two data 
sources, or more than three data sources. As noted above, system 10 may be 
configured to regressively cluster the datasets within each of data sources 12a- 
12c individually. In some embodiments, such an adaptation may be incorporated 
within program instructions 18 which are executable by processor 20 of central 
station 14 as described in more detail below. In addition or alternatively, the 
adaptation to regressively cluster the datasets within each of data sources 12a- 
12c may be incorporated within the respective data sources. In particular, data 
sources 12a-12c may include storage mediums 26a-26c with program 
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instructions 28a-28c which are executable through processors 29a-29c for 
regressively clustering data as described below. 

[0015] As noted above, an EM, KM or KHM objective function may be used for 
the regression clustering (RC) process for the datasets of data sources 12a-12c. 
In most cases, the same regression clustering technique is used for all of data 
sources 12a-12c. In other words, the data within data sources 12a-12c is mined 
by an RC process which incorporates one of the EM. KM and KHM objective 
functions. In this manner, the datasets within data sources 12a- 12c may be 
mined as a whole using the same RC process. Regarding the use of the EM. KM 
and KHM objective functions, three methods of regression clustering are provided 
herein. In each method, a set number of functions. K, may be selected from a 
family of functions, <D, derived from datasets having similar variables by which to 
store information. The functions may be selected randomly or by any heuristics 
that are believed to give a good start. The determination of the optimum K may 
include techniques used in the data mining industry for clustering. 
[0016] In embodiments in which Mean-Square Error (MSE) linear regression is 
used in the RC process, selecting the number of K functions may further include 
initializing coefficients, c*. of the functions {c k \k = 1, ...*} . As will be described in 
more detail below, the datasets within data sources 12a-12c are separately 
processed with respect to the selected K functions. Information representing the 
processed data is collected at a central station and c k is recalculated to 
compensate for the differences between each of the datasets. In general, the first 
set of instructions may be conducted by program instructions 18 of storage 
medium 16 of central station 14. In this manner, each of data sources 12a-12c 
may receive the same select number of functions and coefficients. 
Consequently, the first set of instructions may further include propagating the K 
number of functions and coefficients c„to data sources 12a-12c. 
[0017] In addition to selecting a set number of functions, each of the regression 
clustering methods described herein may include applying K regression functions. 
M (where M = «...../,} <=*). to the data, finding its own partition. Z t . and 
regressing on the partition. The K regression functions are not necessarily linear. 
Both parts of the process, i.e.. the K regressions and the partitioning of the 
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dataset, optimize a common objective function. As will be described in more 
detail below, the partition of the dataset can be "hard" or "soft." A "hard" partition 
may refer to the designation of every datapoint within a dataset belonging to a 
subset. In this manner, the partitions of the datapoints may be clear and distinct. 
A "soft" partition, however, may refer to the ambiguous groupings of datapoints 
within subsets of a dataset. In some cases, such a categorization of datapoints 
may depend on the probability of datapoints belonging to particular subsets within 
the dataset rather than other subsets. Such a soft-partitioning of data is 
employed by the KHM and EM regression clustering method as described in 
more detail below. 

[0018] The method of regression clustering using a K-Means objective function 
(referred to herein as RC-KM) solves the following optimization problem: 

nun Pwfcc-m-Z Z <fM\y>) ^ 

where Z represents a dataset with supervising responses x and y (i.e., 

Z = (*.r)=«W«)l i-WV}) and Z=UL Z * (Z 4 nz r = 0,***'). The 
optimal partition will satisfy: 

Z k = {(*. y)eZ\ e(/r (x), y) Z e{f k T (x), y) V* 1 * k} . (2) 

which allows the replacement of the function in optimization problem (2) to result 
in: 

Perf KC . m {ZAf t }L) = iMme(fM)^=^ K )- < 3 > 

In other words, RC-KM determines an optimal clustering of datapoints by 
regressing functional relationships of the datapoints to have a minimum amount 
of total variation or error (e). 

[0019] In general, the process of RC-KM may be executed through a monotone- 
convergent algorithm to find a local optimum of equation (1 ). One example of an 
RC-KM algorithm may include a first set of instructions for picking a set number of 
functions, K, within a dataset as described above. In some embodiments, 
selecting the number of K functions may further include initializing coefficients, c k , 
of the functions {c k |Jfc = l, ..X) . In general, the first set of instructions may be 
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conducted by program instructions 18 of storage medium 16 of central station 14. 

Consequently, the first set of instructions may further include propagating the K 

number of functions and coefficient c*to data sources 12a-12c. 

[0020] In addition to selecting K number of functions, the RC-KM algorithm may 

include a second set of instructions for repartitioning the dataset in the r-th 

iteration, r=1, 2, .... as: 

Z<" = {(*, y)eZ\ e{fr\*)> JO ^ «■ V * ' * k) ' (4) 
Such a repartitioning process facilitates a "hard" partition, as defined above. 
Each datapoint within the dataset may be associated with the regression function 
that results in the smallest approximation error. Using the RC-KM algorithm, 
distances between each of the datapoints and the regression functions may be 
determined and the errors of fitting the datapoints to the functions are compared. 
Algorithmically. for r>1, a data point in is moved to Z<:> if and only if: 

a) e(Jir 1 \x),y)<eifr ) (^y) and 

b) e(fir l) (*)• y) * <f* r>) y ) for a " * " * k,k ' • 

Z< r) inherits all the data points in that are not moved. In the event of a tie 
between the error functions, the datapoint may be randomly grouped in either 
subset. 

[0021] In addition to program instructions for function selection and clustering, 
the RC-KM algorithm may include a third set of program instructions for running a 
regression optimization algorithm. In particular, the third set of instructions may 
include an algorithm by which to alter the selected functions to more closely 
represent the datapoints within the respective partitions. In some cases, variable 
selections for the K regressions can be done on each partition independently with 
the understanding that an increase in the value of the objective function could be 
caused by such a process. In any case, the third set of program instructions may 
include any regression optimization algorithm that results in the following: 

// r) =argmin £ e(f(x,).y,) ( 5 ) 

where k = \,...,K. In some embodiments, regularization techniques may be 
employed to prevent over-fitting of the converged results from the regression 
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algorithm. In addition or alternatively, boosting techniques may be used on each 
partition independently to improve the quality of the converged results within each 
partition. In any case, the regression algorithm may be selected by the nature of 
the original problem or other criteria. The fact that it is included in a regress^ 
clustering process adds no additional constraint on its select,on. 
[0022] In order to cluster the data into the optimum partitions, the second and 
third set of instructions of the RC-KM algorithm may be conducted repeatedly. 
Optimally, such a reiterative process continues until there is no more datapo.nts 
changing their membership within the partitions. If any datapoint does change ,ts 
partition membership as a result of the second and third sets of instructions, he 
value of the objective function in equation (1) decreases. Consequently, the 
value of the objective function in equation (1) continues to decrease with each 
membership change. As a result, the RC-KM algorithm stops in finite number of 

iterations. „ 0 
[0023] As noted above, some clustering techniques, such as K-Means 
clustering methods, may be sensitive to the initialization of partition centers. 
Similariy, RC-KM may be sensitive to the initialization of its K functions. More 
specifically, the convergence of data into clusters using RC-KM may depend on 
how closely the initia. set of K functions represent the data, since the datapoints 
are partitioned into distinct subsets (i.e.. hard partitioned) with respect to the 
selected functions during each iteration of the algorithm. In general the 
initialization of the K functions may be dependent on the amount of and qualrty of 
available prior information. In many instances, however, there is minimal or no 
prior information available regarding the functional relationship of variables wrthm 
a dataset. In some cases, more than one functional relationship may be found to 
represent a partition of data. As a result, convergence to a distinct set of 
partitions may be difficult using RC-KM techniques. In other cases, however, the 
initialization of the K functions using RC-KM may be good and, as a result, a 
dataset may be clustered into an optimum set of partitions using an RC-KM 

algorithm. /uojm\ 
[0024] in contrast to K-Means clustering techniques, K-Harmon.c Means (KHM) 
clustering algorithms are generally less sensitive to the initialization of the K 
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functions due to KHM's methods of dynamically weighting data points and its 
"sort" partitioning scheme. An exemplary harmonic average based clustering 
method is described in U.S. Patent 6,584,433 to Zhang et al. and is incorporated 
by reference as if fully set forth herein. Similar to KHM clustering, the 
K-Harmonic Means regression clustering process (RC-KHM P ) described herein ,s 
generally less sensitive to the initialization of the K functions as discussed .n more 
detail below. RC-KHM p 's objective function is defined by replacing the MINQ 
function in equation (3) by the harmonic average function, HAQ. In addition, the 
error function may be represented as K/*(*.)>»HI II" . wherep*2. As 
a result, the objective function of RC-KHM P may be: 

in general, different values of parameter p may represent different distance 
functions. 

[0025] As noted above, an exemplary method of K-Harmonic Means regress.on 
clustering is depicted in the flowchart of Fig. 2. Such a method is described 
herein in reference to an exemplary algorithm for RC-KHM P . As with RC-KM, RC- 
KHM P may be employed through an algorithm which includes a first set of 
instructions for selecting a set number of K functions randomly or by any 
heuristics that are believed to give a good start. Such a process is noted as block 
30 in Fig. 2. In some embodiments, the step may further include initializing 
coefficients, c, of the functions { c k \k = l, •••*} . In general, the first set of 
instructions may be conducted by program instructions 18 of storage medium 16 
of central station 14. Consequently, the first set of instructions may further 
include propagating the K number of functions and coefficient c* to data sources 
12a-12c. 

[0026] As noted above, the selected functions may be a subset of a pluralrty of 
functions used to correlate variable parameters of a dataset. In contrast to the 
hard partitioning used in RC-HM. RC-KHM P uses a soft partitioning scheme. 
Consequently, datapoints may not be distinctly associated with a single funct.cn 
when using an RC-KHM P algorithm. Rather, the RC-KHM P process may include 
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determining the distances between each of the datapoints and each of the 
functions and computing probability and weighting factors associated wrth such 
distances for each of the datapoints as noted in blocks 32 and 36 in the flowchart 
of Fig 2. respectively. In turn, the RC-KHM P algorithm may include a second set 
of instructions to determine approximate associations of the datapoints to the K 
functions based upon the probability and weighting factors. The calculation of the 
harmonic averages noted in block 34 may be used in the objective function of 
RC-KHMp as noted in equation (6) above and explained in more detail below. In 
general, the calculations of the weighting and probability factors may be 
computed by program instructions 28a-28c of storage mediums 26a-26c, 
respectively of data sources 21a-12c. In this manner, the value of the weighing 
and probability factors may be dependent of the value of the local datapomts 
z , eZ, as well as the values of the "global" or "common" coefficient vectors 
{c k \k = 1, ...K} in some cases. 

[0027] In general, the probability of the « -th data point belonging to a particular 
k function may be computed as: 



(7) 

wherein: 



4, HI /TV,)-* II- < 8 > 

The parameter q may be used to put the regression's error function as noted in 
equation (10) below in D -space. In addition, the parameter q may be used to 
reduce the association of datapoints to more than one of the selected K functions, 
in any case, the weighting factor for each datapoint may be computed using (i.e., 
each data point's participation may be weighted by): 

i=l / M 

In this manner, not all datapoints fully participate in all iterations in RC-KHM P like 
in RC-KM. As shown in equation (9), the value of weighting function «,(*,) for a 
particular datapoint is proportional to the distance between the datapoint and the 
function. In particular, the value of weight function «,(z,.) is smaller when the 
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datapoint is closer to the function than if the datapoint is farther away from the 
function. Weighting function a p (z,) changes in each iteration as the regression 
functions are updated and, thus, is dynamic. As described above in reference to 
RC-KM and will be described below in reference to RC-EM, the participation of 
each datapoint is not weighted. As such, a p (z,) is equal to 1 in RC-KM and RC- 
EM as noted below in equations (18) and (22). 

[0028] As shown in block 38 in the flowchart of Fig. 2, the RC-KHM P process 
may include regressing K function using the probability and weight factors 
computed in block 36. In particular, the RC-KHM P process may run any 
regression optimization algorithm that results in: 

= arg min £ a f {z,)p(Z k | z,) || /(*,) - X II' ( 1 °> 

where k = l,...,K. For simpler notations, p(Z t |z,) and a p {z t ) are not indexed in 
equation (10) by q or p. In addition, d ik , P (Z k \z t ), and a p (z,) in equations (7), 
(8), (9) and (10) are not indexed by the iteration rto simplify notations. As in RC- 
KM, variable selections for the K regressions in RC-KHM P can be done on each 
partition independently with the understanding that an increase in the value of the 
objective function could be caused by such a process. In addition, regularization 
techniques and/or boosting techniques may be employed to improve the quality 
of the converged results. In any case, the regression algorithm may be selected 
by the nature of the original problem or other criteria. The fact that it is included in 
a regression clustering process adds no additional constraint on its selection. 
[0029] Step 40 includes the reiteration of blocks 34, 36, and 38 for the 
regressed set of functions. More specifically, the RC-KHM P process involves 
determining the distances between each of the data points and the regressed 
functions, calculating harmonic averages of such distances and computing 
probability and weighting factors for the datapoints based upon the determined 
distances. Steps 42, 44, and 46 outline a method for relating the information 
within the dataset, such as the datapoints and the probability and weighting 
factors, with dataset information from other data sources. In other words, blocks 
42, 44 and 46 outline a scheme for regressively clustering data distributed across 
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several distinct data sources. In this manner, the method depicted in Fig. 2 is 
configured to regressively cluster distributed data in parallel and as a whole. A 
more detailed description of such a process is provided below in reference to 
Fig. 3. 

[0030] Referring to block 48, the RC-KHM P process may include computing a 
change in harmonic averages for the K functions prior to and subsequent to the 
regressing process described in reference to block 38. Such a computation may 
be included within the objective function for RC-KHM P as cited in equation (6) 
above. Step 50 may be used to determine if the change in harmonic averages is 
greater than a predetermined value. More specifically, since there is no discrete 
membership change in RC-KHM P , the continuation or termination of the method 
may be determined by measuring the changes to the RC-KHM P objective function 
(i.e., equation (6)). For example, in embodiments in which the change in 
harmonic average (i.e., the objective function) is greater than the predetermined 
value, the method may revert back to block 32 and determine distances between 
datapoints of the dataset and values correlated with the new set of functions 
computed from blocks 40-46. The method may subsequently follow the flow 
blocks 34-50 and, thus, provides an iterative process until the change in harmonic 
averages is reduced to a value below the predetermined level noted in block 50. 
As shown in Fig. 2, upon determining the change in harmonic averages (i.e., the 
objective function) is less than the predetermined value, the method may 
terminate. In particular, when the change in the objective function is less than a 
predetermined value, the method may stop. Alternatively, the method may be 
terminated when value of the objective function is less than a predetermined 
value. 

[0031] Referring to an RC-EM process, the objective function is defined as: 

Perf RC . EU {Z,M) = 

-4ni^^^>.<*.>-»^>-»i (11> 

where d = dimension (Y). In the case in which d=1, (AC*,)-*) is a real number 
and s;' =l/<r t 2 . An exemplary RC-EM algorithm may include a first set of 
instructions to select a set number of K functions, as described in reference to 
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RC-KM and RC-KHM P . In some embodiments, the first set of instructions may 
further include instructions for initializing coefficients, c k , of the functions {c k \k = 1, 
K}as described above. In general, the first set of instructions may be 
conducted by program instructions 18 of storage medium 16 of central station 14. 
Consequently, the first set of instructions may further include propagating the K 
number of functions and coefficient c* to data sources 12a-12c. In addition to 
function selection, the RC-EM algorithm may include two steps by wh.ch to 
regressively cluster a dataset. In particular, the RC-EM a.gorithm may include an 
expectation step (E-Step) and a maximization step (M-Step). 
[0032] In general, the E-Step may be used to determine how much of each 
datapoint is related to each subset. Such a step may be conducted by computing 
a probability factor in which: 

4^^-i(/r i, (x i )-y,)^ u (/r ,) ^)-^) r ) 

) 2 . . (12) 

£^ 2 

The M-Step may use such a probability factor to regress the selected functions of 
the dataset. In particular, the M-step may use the following equations to regress 
the functions of a dataset 

p?=±ip& r) w (13) 

//'> = arg min £ p{Z? , z,) || /(*,) - y ( || 2 ( 1 4 > 

/e<D ,'=l 

£ pW I z,y(j? - y ,y ifFW - y») 

T r,k=— N *pP 

The E-Step and M-Step may be conducted in an iterative process. As with RC- 
KM RC-EM may be sensitive to the initialization of functions and. consequently, 
may have difficultly in converging the datapoints in an optimal set of subsets .n 
some embodiments. In other cases, however, the initialization of functions with.n 
a dataset may be good and the dataset may be clustered into an optimum set of 
partitions using an RC-EM algorithm. 



(15) 
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[0033] As noted above, a system and a method are provided which are 
configured to regressively cluster distributed data in parallel as well as a whole. 
In particular, Fig. 1 illustrates system 10 which is configured to regressively 
cluster the data within data sources 12a-12c in parallel without transferring data 
between the data sources. In addition, system 10 is configured to regressively 
cluster the datasets within data sources 12a-12c as a whole through central 
station 14. As described above, the methods provided herein may include 
regressively clustering data through the use of one or more algorithms and, 
therefore, may be best implemented through a computer. Consequently, central 
station 14 may be a computer in some cases. In addition, the methods described 
herein may, in some embodiments, be referred to as a "computer-implemented 
methods." In other cases, however, the methods described herein may be more 
generally referred to as "methods." The use for the two terms is not mutually 
exclusive and, therefore, may be used interchangeably herein. 
[0034] In general, central station 14 may be communicably coupled to data 
sources 12a-12c such that input 22 may be received from the data sources and 
output 24 may be sent to the data sources. More specifically, input 28 may be 
transmitted from data sources 12a-12c to central station 14 to execute program 
instructions 18 within storage medium 16. Similarly, input may be transmitted to 
data sources 12a-12c from central station 14 or any other data input source to 
execute program instructions 28a-28c within storage mediums 26a-26c. Storage 
mediums 16 and 26a-26c may include any device for storing program 
instructions, such as a read-only memory, a random access memory, a magnetic 
or optical disk, or a magnetic tape. Program instructions 18 and 28a-28c may 
include any instructions by which to perform the processes for RC-KM, RC-KHM 
and RC-EM described above. In particular, program instructions 18 and 28a-28c 
may include instructions for selecting a set number of functions correlating 
variable parameters of a dataset and other instructions for clustering the dataset 
through the iteration of a regression algorithm and a KM, KHM or EM 
performance function applied to the set number functions as described above. 
[0035] In addition, program instructions 18 and 28a-28c may include 
instructions for collecting dataset information from data sources 12a-12c to 
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regressively cluster the datasets therefrom as a whole. A more detailed 
description of such program instructions are provided in reference to the flowchart 
illustrated in Fig. 3 as well as blocks 42, 44 and 46 in Fig. 2. More specifically, 
Fig. 3 depicts a set of processes which may be performed at central station 14 
with respect clustering the datasets from data sources 12a-12c as a whole. 
Steps 42, 44 and 46 in Fig. 2 depict a set of step which may be performed at 
each of data sources 12a-12c with respect to sending and receiving information 
from central station 14 for the method described in reference to Fig. 3. As shown 
in block 42 of Fig. 2. matrices may be developed for each of data sources 12a- 
12c from the datapoints and the probability and weighting factors associated with 
the datasets therein. Such a process may be executed by program instructions 
18 included within storage medium 16 of central station 14 or program 
instructions 28a-28c included within storage mediums 26a-26c of data sources 
12a-12c. 

[0036] In either case, the matrices developed from data sources 12a-12c may 

be set forth as A,* and bi, k , 

(A„=X!*diag(w tj yX,b l s=Xj*diag(w kj yY),k = l,...,K (16) 

where the data set (X,Y) located on L data sources 12a-12c, (X„Y,),l = U.,L, 
is the subset on the T computer and the size of the(Jr„ Y t ) is N, . The diagonal 
matrix of (wuJ = diag {a( Zi )p(Z k I z,) I / e subset of indices of the datapoints in the 
I* computer; with a(z,) and p(Z k I z) defined with the respect to the type of 
regression clustering technique used to cluster the dataset. In particular, a(z,) 
and p(Z k I Zi) may be defined as noted below with respect to using a RC-KM, RC- 
KHM or an RC-EM technique to cluster the data. 
RC-KM: 

JO fc^argmin{||^.(x,)-xlf} (-|7) 
p(Z k \z t )-^ fc = argmin{ | |/t .( Xi )- x .|| 2 } 

a(z^\ ( 18 > 
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RC-KHM: 

p(z k \ 2i )=d>>;</£d>r. 09) 



/=1 / /=1 

RC-EM: 



(20) 



tfw ffi , ~~ r (21> 

KVftl 2 

afa) = 1 (22) 
[0037] The matrices may be collected from data sources 12a-12c at central 
station 14 through input 22 as outlined in block 60 of Fig. 3. Consequently, the 
method depicted in Fig. 2 may include block 44 in which the matrices from one 
data source is combined with matrices from other of data sources 12a-12c. Such 
a transfer of information may be initiated by program instructions 18 included 
within central station 14 and/or program instructions 28a-28c within data sources 
12a-12c. As shown Fig. 3, the method may further include block 62 in which 
common coefficient vectors are computed from the composite of matrices. Such 
a calculation may be computed at central station 14 by program instructions 18. 
The common coefficient vectors may be sent to data sources 12 as shown in 
block 64 of Fig. 3 and multiplied by the respective regression functions of each 
data source as noted in block 46 in Fig. 2. In general, the common coefficient 
vectors computed in block 62 may be used to compensate for variations between 
similar sets of functions within the datasets of data sources 12a-12c. More 
specifically, the common coefficient vectors may be used to compensate for 
variations between set of functions having similar response variables. 
[0038] In some embodiments, a total residue error of the common coefficient 
vectors may be computed between iterations of the regression clustering 
process. In particular, the variation of c* between iterations may be calculated at 
central station 14 to determine whether to continue the regression clustering 
process. Such a computation may offer a manner in which to monitor the 
progress of the regression clustering process in addition to the computation of the 
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change in harmonic averages at each of data sources 12a-12c as described 
above in reference to block 50. In particular, the regression clustering process 
may be terminated upon detecting changes in coefficient values which are less 
than a predetermined value. In addition, the regression clustering process may 
continue to block 50 upon detecting changes in the coefficient values are greater 
than a predetermined value. The residue error calculation may be conducted 
prior or subsequent to block 64 in Fig. 3 in which the coefficient vectors are sent 
to each of the data sources. 

[0039] The optimal common coefficient vector, c*. may be calculated at central 
station 14 by summing the matrices received from data sources 12a-12c such 
that 

A k — X T * diag(w k )*X = t^, r * diag(w tJ ) * X 

,k=1,..,K (23) 



b k = X T *diag(w k )*Y = Y J Xj*diag{yv kl )*Y 



and using such summed matrices to compute c k as: 

c t =A; l b l ,k=1,...,K. (24) 
Although such a computation does involve the transfer of information between 
data sources 12a-12c and central station 14, the amount of data transferred is 
significantly smaller than the size of each dataset on data sources 12a-12c. 
Choosing D functions as a basis, A kJ is a DxD matrix and b kJ is a D 
dimensional vector. The total number of floating point numbers to be transmitted 
from each of data sources 12a-12c to central station 14 is D 2 +D . The total size 
of all the coefficients c„ , which are transmitted back from central station 14 to all 
of data sources 12a-12c, is DxK floating point numbers. All these sizes are 
minute compared with the size of the dataset on each of data sources 12a-12c. 
[0040] Table 1 provided below includes an exemplary layout of the processes 
included in the RC process described herein as well as which entity (i.e., data 
sources 12a-12c or central station 14) may be used for each step. Such a 
general outline of the RC process may be applied to any of the RC techniques 
provided herein (i.e., RC-KM, RC-KHM and RC-EM). In some cases, the RC 
process may include processes other than those included in Table 1 and 
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described above. As such, the RC process is not necessarily restricted to 
process outlined in Table 1 . In addition, the designation of the processes to 
conducted by central station 14 or data sources 12a-12c in Table 1 may 
reversed or alternatively conducted by both entities, in some embodiments. 



Calculations on Data Sources 12a-12c 


r^is*! iiratir>riG f\n (^Antral Station 14 

Step 1: Initialization 

a) Pick K functions fP eO 

and, in some cases, initialize the 
coefficients {c k \k = \, ...K) randomly, 
or by any other heuristic. 

b) Propagate the functions/ 
coefficients to data sources 12a-12c. 


Step 2: Clustering: In the r-th iteration, 

a) Calculate the probability p(Z k \ z f ) of 

K functions , and 

b) Optionally, calculate the dynamic 
weighting factor a(z,) 




Step 3: Regression: 

a) Regress the K functions with regard 
to the weighting and probability factors 

b) Calculate w JiW = p(Z k \ z t )a(z t ) 9 z. e Z, . 

c) Calculate 

A ltk =X?*diag(w lik )*X n 

b lJt =Xf*diag(w lk )*Y l9 k = l,...,K 

d) Send information set {A l k ,b l k \ k = 1,..., 
K) to central station 14 






Step 4: Global Coefficient Calculation: 

a) Calculate the summation: 

,k=1,...,K. 

b) Calculate: i^" 1 , k=1 f -..,K. 

c) Calculate the global coefficients: 

c k =A;\, k=l K. 




Step 5: Residual Error Check 

a) Check the change on the total 
residue error 

b) Propagate the new coefficient set 
{c k | k=l,...,K) to data sources 
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Step 6: Process Continuation 

a) Stop the iterative process when a 
termination message is received from 
central station 14; OR 

b) Compute a change in harmonic 
averages for the previous selected 
functions 

c) Repeat steps 2-6 when the harmonic 
averages are above a predetermined 
value and stop the iterative process 
when the harmonic averages are less 
than a predetermined value 



10041] The above discussion is meant to be illustrative of the principles and 
various embodiments of the present invention. Numerous variations and 
edifications will become apparent to those skilled in the art once the above 
disclosure is fully appreciated. For example, the systems and methods described 
herein may be incorporated within any type of data system, including those with 
distributed data and non-distributed data. It is intended that the following cla,ms 
be interpreted to embrace all such variations and modifications. 
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